This assignment has three parts:
The assignment folder has a number of files and folders. They relate to the parts of the assignment as follows: This notebook (which you've apparently already found)
Part1_Amazon_Reviews:
Part2_Cat_Classifying:
Part3_George_W_Bush:
Note that throughout this notebook we're going to be using various packages that you may not have installed. If you encounter an error using one, check to make sure you've installed it. If you haven't, it should be easy to do so with conda.
In this part of the assignment you'll use text from reviews of products on amazon.com to predict what overall score the reviewer gave the product. The dataset is from here. There were lots of different datasets to choose from, but we felt like going with beauty products this time to do something a little different.
There are three datasets within the Part1_Amazon_Reviews folder, one for each of training, validation, and test. There are 198502 reviews total across the dataset, and we've kind of arbitrarily separated them into 60% for training, 30% for validation, and 10% for testing... ish. The exact percentages don't matter too much here.
Take a look at what's inside the data. You can open it in a browser or a text editor. Both should show you tons of text. If you haven't seen .json data before, this is what it looks like. Each review object starts with a { and ends with a }, and contains a bunch of features separated by commas. Each feature has a name and a value.
1.1 (2 points): In 13 words or less per feature, explain what you think each feature in the dataset means in the box below.
print (len(("reviewerID is the guid representing the user providing the review").split()) < 14)
print (len(("asin is the Amazon product ID for the product being reviewed").split()) < 14)
print (len(("reviewerName is the user's chosen name displayed on the review").split()) < 14)
print (len(("helpful is a tuple of other users' votes determining helpfulness").split()) < 14)
print (len(("reviewText is the actual content of the review").split()) < 14)
print (len(("overall is the star rating from 1-5 provided by the reviewer").split()) < 14)
print (len(("summary is the title of the reivew that is shown above the content").split()) < 14)
print (len(("unixReviewTime measures the time in Unix epochs for data storage and system logs").split()) < 14)
print (len(("reviewTime is the same as unixReviewTime but in human-readable format").split()) < 14)
# Keep going until you have all the features
The first thing we have to do is import the data. Normally json data is really easy to import, but this file is in a slightly annoying format. Notably, the reviews don't have commas separating them. Fortunately this is pretty easy to overcome with a bit of iterating. For this assignment, we only care about the review text and the overall score within each review. To save you a bit of googling, you can access individual items within a json object by doing object["featurename"].
1.2 (3 points): Import data into X and y arrays for train, validation and test sets.
We'll give you the imports:
import json
import numpy as np
import os
# Put your resulting data in numpy arrays named:
# "Train_X", "Train_y", "Validation_X", "Validation_y", "Test_X", and "Test_y"
# with X as review text and y as overall score
# Your code goes here
#Test Data
Test_X=[]
Test_y=[]
with open("Part1_Amazon_Reviews/beauty_test.json") as test_data_all:
for line in test_data_all:
newtext=line
Test_X.append(json.loads(newtext)["reviewText"])
Test_y.append(json.loads(newtext)["overall"])
Test_X=np.asarray(Test_X)
Test_y=np.asarray(Test_y)
#Training Data
Train_X=[]
Train_y=[]
with open("Part1_Amazon_Reviews/beauty_train.json") as train_data_all:
for line in train_data_all:
newtext=line
#train=json.loads(line)
Train_X.append(json.loads(newtext)["reviewText"])
Train_y.append(json.loads(newtext)["overall"])
#line=line+1
Train_X=np.asarray(Train_X)
Train_y=np.asarray(Train_y)
#Validation Data
Validation_X=[]
Validation_y=[]
with open("Part1_Amazon_Reviews/beauty_validation.json", 'r') as valid_data_all:
for line in valid_data_all:
newtext=line
#valid=json.loads(line)
if newtext!= '':
Validation_X.append(json.loads(newtext)["reviewText"])
Validation_y.append(json.loads(newtext)["overall"])
Validation_X=np.asarray(Validation_X)
Validation_y=np.asarray(Validation_y)
# If you've done this right, you should get (125398,) (125398,) (63104,) (63104,) (10000,) (10000,)
print (Train_X.shape, Train_y.shape, Validation_X.shape, Validation_y.shape, Test_X.shape, Test_y.shape)
We're going to test a how good a bunch of models are at predicting scores on this dataset. It would be cheating to keep training on the training set and testing on the test set, as you could just end up finding the model that bets fits that specific data rather than fitting a variety of data that you may not yet have seen.
Thus, we'll be using cross-validation. You've already seen a bit of this in A2, but here is an explanation.
Before we get to that though, we're going to build a pipeline to prepare the text data to be classified and then build a classifier all in one step. Here is a rather lengthy explanation of what pipelines are. People have tons of different syntactical styles in writing pipelines, but we'll be using the syntax at 14:30 in the video above. It'll look something like this:
# Don't run this, it's just an example. The semicolon at the end isn't part of the pipeline.
"""
model = Pipeline([
('name of transform1', Transform1()),
('name of transform2', Transform2()),
('name of classifier', Classifier()),
])
model.fit(X, y)
"""
;
So what's happening here? "model" is a Pipeline object. It's defined in the "model = ..." block, and used with model.fit(X, y). It works by taking whatever you put into it, in this case X, and y, and running them through each step in the Pipeline. Typically this means each step until the last will transform the data in some way, and the last step will fit a classifier to the transformed data. Once a Pipeline is finished, you can use it by calling one of the methods of the final function in the pipeline. For example, in order to call model.fit(X,y), the last function in the Pipeline needs to be a classifier as transformers don't have the .fit() method.
Note that I'm using Pipeline rather than pipeline above, because these things all apply specifically to the Pipeline class in sklearn, but not necessarily to the broader concept of a ML pipeline
What are the advantages of using a Pipeline? Well:
Convinced yet? If not, that's cool. You're gonna do it anyways. In this pipeline you're going to have three steps. A CountVectorizer transformer, a TfidfTransformer, and a classifier. You'll play with a few different classifiers to see which one works best.
1.3 (2 points) Look up the documentation for CountVectorizer and TfidfTransformer. For each, explain what it's going to do to your data in fewer than 30 words.
print (len(("CountVectorizer creates an array counting the number of times every word that appears in the passed list of data, sorted by index of the list").split()) < 31)
print (len(("Tfidf accounts for larger documents by creating a term frequency for each word, which is the count divided by the total number of words in the document being analyzed").split()) < 31)
Next, make a Pipeline. The first two steps will be the above two transformers. The third step will be a classifier. We'll be testing out five classifiers here: KNeighborsClassifier, MultinomialNB, LogisticRegression, RandomForestClassifier, and an additional one of your choice. Just put one in for now; we'll swap them in and out later.
1.4 (6 points) Build a Pipeline to do the above using the format shown in the model = Pipeline([ example above.
We'll let you figure out the imports this time.
# Your code here
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
#pipeline with MNB
'''
model=Pipeline([
('countvect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('mnb', MultinomialNB())
])
'''
#pipeline with KNN
'''
model=Pipeline([
('countvect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('knn', KNeighborsClassifier())
])
'''
#pipeline with Logistic Regression
'''
model=Pipeline([
('countvect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('lr', LogisticRegression())
])
'''
#pipeline with Random Forest
'''
model=Pipeline([
('countvect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('rf', RandomForestClassifier())
])
'''
#pipeline with Decision Tree
'''
model=Pipeline([
('countvect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('dt', DecisionTreeClassifier())
])
'''
#for i in range(0, len(Validation_X)):
# if Validation_X[i]=='':
# print (i)
#print(Validation_X[8043])
#np.savetxt('test.csv', Validation_X, delimiter=",", fmt='%s')
Next we're going to do some cross validation to test out how well each of the classifiers does on the data.
1.5a (6 points) In the cell below, run five-fold cross validation using each of the five classifiers from above using the cross_val_score function in sklearn.model_selection. For each classifier, output both the accuracy score and the weighted f1 score. Write down the averages of each of these over the five folds in the slots below.
You can read about f1 scores in the cross_val_score documentation. Basically f1 is a complement to accuracy that incorporates both precision and recall.
NOTE: Make sure you're doing cross validation on your Validation datasets! To use the Train or Test sets defeats the whole purpose.
# You'll need to import four things
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
# Your code goes here
#create five folds
results=cross_val_score(model, Validation_X, Validation_y, scoring = 'f1_weighted')
print (results.mean())
# Write in your average values below. Three digits is fine:
print("KNeighborsClassifier_AvgAccuracy = 0.40803945161934224")
print("KNeighborsClassifier_AvgF1 = 0.3066874778556461")
print("MultinomialNB_AvgAccuracy = 0.5783785710376382")
print("MultinomialNB_AvgF1 = 0.42447559750728164")
print("LogisticRegression_AvgAccuracy = 0.653191701166847")
print("LogisticRegression_AvgF1 = 0.6052393595561765")
print("RandomForestClassifier_AvgAccuracy = 0.584780707947442")
print("RandomForestClassifier_AvgF1 = 0.4963436109521886")
print("YourPick_AvgAccuracy = 0.5037875472784398")
print("YourPick_AvgF1 = 0.49646850363058864")
PART 1.5 OPTIONAL BONUS (3 points or 5 points): 3 points: Build a Pipeline that ends up getting you greater than 0.75 accuracy and greater than 0.65 weighted f1 score. You can use different or additional transformers and whatever classifier you want to use from anywhere, not necessarily just sklearn. Additional 2 points: get greater than 0.80 accuracy and greater than 0.7 weighted f1 score. This is going to be HARD, but there are definitely ways to do this classification far better than we just did.
# Cell for optional part 1.5 bonus code
#cleaning validation array to remove blank inputs
'''
newValidation_X=[]
newValidation_y=[]
with open("Part1_Amazon_Reviews/beauty_validation.json", 'r') as valid_data_all:
for line in valid_data_all:
newtext=line
if json.loads(newtext)["reviewText"]!='':
newValidation_X.append(json.loads(newtext)["reviewText"])
newValidation_y.append(json.loads(newtext)["overall"])
print (len(Validation_X))
print (len(newValidation_y))
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
model=Pipeline([
('countvect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('lr', LogisticRegression())
])
results=cross_val_score(model, newValidation_X, newValidation_y)
print(results.mean())
'''
Finally it's time to see how the model really performs on test data.
1.6 (6 points) Pick the classifier that performed best above, then fit a model on your training data and predict on your test data. Print out your accuracy and f1 scores, and then print out the confusion matrix.
# Your code goes here
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
model=Pipeline([
('countvect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('lr', LogisticRegression())
])
model.fit(Train_X, Train_y)
predictedvalues=model.predict(Test_X)
print("Accuracy Score: ")
print (accuracy_score(Test_y, predictedvalues))
print ("f1 Score: ")
print (f1_score(Test_y, predictedvalues, average='weighted'))
print ("confusion matrix: ")
print (confusion_matrix(Test_y, predictedvalues))
Congrats! You've made a halfway-decent text classifer that does something vaguely useful!
The next thing we're going to do is try to see if we can create a classifier to tell the difference between images of cats and images of non-cats. We'll be using the pleasantly-named "Pillow" fork of the Python Imaging Library (PIL).
Starring in the cat images portion of this dataset are Taro, Teru, Rhubarb Penelope (Ruby), Mallow, and Hubble, all of whom are members of the 4th year HCII PhD cohort:
# Just run this. Also you can probably get some clues from this for your next steps.
# We figured it was worth giving away answers to show pictures of these cats.
import PIL
from PIL import Image
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(32, 32))
img = Image.open('Part2_Cat_Classifying/Taro.jpg')
title = "Taro"
sp = plt.subplot(5, 1, 1)
sp.set_title(title)
plt.imshow(img)
img = Image.open('Part2_Cat_Classifying/Teru.jpg')
title = "Teru"
sp = plt.subplot(5, 1, 2)
sp.set_title(title)
plt.imshow(img)
img = Image.open('Part2_Cat_Classifying/Ruby.jpg')
title = "Ruby"
sp = plt.subplot(5, 1, 3)
sp.set_title(title)
plt.imshow(img)
img = Image.open('Part2_Cat_Classifying/Mallow.jpg')
title = "Mallow"
sp = plt.subplot(5, 1, 4)
sp.set_title(title)
plt.imshow(img)
img = Image.open('Part2_Cat_Classifying/Hubble.jpg')
title = "Hubble"
sp = plt.subplot(5, 1, 5)
sp.set_title(title)
plt.imshow(img)
The non-cat images are a subset of the OASIS image dataset by Benedek Kurdi, explained here. We've removed images from this dataset that show people or animals, just to try to make classification easier.
Because we're only using a subset of the OASIS dataset, please don't use these images for other projects. The full dataset is available for free download at the above link if you want to use it for another purpose.
Example scenery image:
# Just run this
img = Image.open('Part2_Cat_Classifying/Lake5.jpg')
plt.imshow(img)
Unfortunately, the formal sklearn Pipeline function isn't actually terribly useful for this task. However, as it turns out, "pipeline" is a broader term that really just means the steps you do to import your data, clean it, visualize it, select features, model, and analyze results. Sklearn's Pipeline is a very common way to do this for some types of data, but you can do it manually for others (or you can write out functions to create your own version of Pipeline for a particular application).
Keep the structure of Pipelines in the back of your mind, though. We'll come back to it.
Before we formally build our pipeline, let's just take a look at the data a bit and see what we can do with it. Here's a picture of Teru. You can see the dimensions of the picture (1920 x 2560) printed above it.
# Just run this
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
terupicture = Image.open('Part2_Cat_Classifying/cats/1.jpg')
plt.imshow(terupicture)
terupicture.size
Next we can take a look at what's actually stored within the data of this image. If you don't know how pixels work, here is a basic explanation. It's strikingly difficult to find a really clear tutorial of this online, but this is the best we found after a lot of googling. If this video is to be believed, the above cat picture should be represented by a whole lot of pixels (1920 times 2560, in fact) which are each made of three values from zero to 255 (representing Red, Green, and Blue respectively). This should look something like [12, 127, 56], [65, 208, 11], [34, 33, 2], ....
This notebook will just show you a limited subset of these (as printing out ~5 million would be a bit much).
teru_image_array = np.array(terupicture)
print(teru_image_array)
Voila, just as we expected.
For the sake of image processing, let's convert this image to grayscale. Again, we'll expect to see values from 0 to 255, but this time it'll be a list of single numbers rather than groups of three.
2.1 (3 points) Convert the image of Teru to grayscale. Show the new image, and then print out the array of numbers like we did above.
HINT: You don't need to (and shouldn't) import anything else for this. There's already a function in something we've imported that will convert images to grayscale.
# Your code goes here
terupicture=terupicture.convert('L')
plt.imshow(terupicture)
print (terupicture.size)
print (np.array(terupicture))
Again, this looks as expected. We can do some random other things with PIL like rotating the image or flipping it or cropping it.
2.2a (1 point) Display a grayscale picture of Teru rotated 20 degrees counterclockwise
2.2b (1 point) Display a grayscale picture of Teru with the bottom 1000 pixels cropped off
HINT: again, no imports needed. The rotated picture will have some black space at the corners now. That's fine.
# Rotate here
terupicturerot=terupicture.rotate(20)
plt.imshow(terupicturerot)
# Crop here
w, h = terupicture.size
print (w, h)
terupicturecrop=terupicture.crop((0,0,w,h-1000))
plt.imshow(terupicturecrop)
terupicturecrop.size
Image "Convolutions" are ways to do some math with the pixel values in images to transform them into something that's more useful for classification. This has a nice explanation of what convolutions are, and we'll use some of the ones they talk about here.
2.3 (2 points) Convolve your original gray Teru picture using the "sharpen" kernel they define, then display it.
HINT: See import.
from scipy.signal import convolve2d
kernel = np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])
gray_terupicture_sharpen=convolve2d(terupicture,kernel)
plt.imshow(gray_terupicture_sharpen)
This'll look better if we equalize exposure. Don't worry too much about what this does- it basically just increases contrast.
#Just run this. There'll be a UserWarning. That's fine.
from skimage import exposure
gray_terupicture_sharpen_equalized = exposure.equalize_adapthist(gray_terupicture_sharpen/np.max(np.abs(gray_terupicture_sharpen)), clip_limit=0.03)
plt.imshow(gray_terupicture_sharpen_equalized, cmap=plt.cm.gray)
Next let's try the edge detection kernel they specify, also equalizing exposure as shown above.
2.4a (2 points) Convolve your ORIGINAL gray Teru picture using the edge detection kernel they define, then display it.
# Your code goes here
kernel = np.array([[-1,-1,-1],[-1,8,-1],[-1,-1,-1]])
gray_terupicture_edge_detection=convolve2d(terupicture, kernel)
gray_terupicture_edge_equalized = exposure.equalize_adapthist(gray_terupicture_edge_detection/np.max(np.abs(gray_terupicture_edge_detection)), clip_limit=0.03)
plt.imshow(gray_terupicture_edge_equalized, cmap=plt.cm.gray)
It kinda sorta found the edges. Looks like it got tricked by fur patterns though.
2.4b (1 point) Which one or two cats might edge processing work best on? Why?
print("Mallow and possibly Hubble would be good. For edge processing to work well, there needs to be good contrast in the photo and contrast only on the edge of the cat (so solid color fur cats work best)")
Finally, let's try a kernel that isn't on the page.
2.5 (2 points) Go to the kernels wikipedia page and pick one of the three blur kernels and apply it here.
# Your code goes here
#using box blur as shown here: http://machinelearninguru.com/computer_vision/basics/convolution/image_convolution_1.html
kernel = np.array([[1,1,1],[1,1,1],[1,1,1]])/9.0
gray_terupicture_blur=convolve2d(terupicture, kernel)
gray_terupicture_blur_equalized = exposure.equalize_adapthist(gray_terupicture_blur/np.max(np.abs(gray_terupicture_blur)), clip_limit=0.03)
plt.imshow(gray_terupicture_blur_equalized, cmap=plt.cm.gray)
Okay, I think you get the picture (pun intended).
Let's make a cat classifier. We'll follow the steps described above in the section where we talked about pipelines in image classification: import your data, clean it, visualize it, select features, model, and analyze results.
There are four arrays of data we need to import to get our classifier working, which we'll combine to get two final arrays that can be passed into a classifier:
Let's start with the first. The first thing you may notice if you scroll through the images in the "cats" folder is that some of the files are in .jpg format and a handful are .png.
Taro has classier owners than all of the other HCII cats, so some pictures of her are in .png format. For simplicity's sake, we want all of our images in .jpg format, so write a script to convert them to .jpg. Name the output something like "Taro1.jpg", "Taro2.jpg", etc. In order to do this, you're going to want to open each .png file in the folder individually in sequence and convert it to .jpg, saving it under a different name. I'll let you figure out how to do this, but note that the glob package makes it really easy to select all the files in a folder with a certain extension.
Note that you may get an error if you try to convert directly from .png to .jpg because .png images have a property (transparency) that .jpg files don't. We don't need this property here, so you can ignore the error if you get it.
2.6 (2 points) Convert the .png files in the cats folder to .jpg files and re-save them as new files
import glob
pngarray=glob.glob('Part2_Cat_Classifying/cats/*.png')
print ((pngarray))
for i in range(0, len(pngarray)):
im=Image.open(pngarray[i])
im.convert('RGB').save('Part2_Cat_Classifying/cats/Taro'+str(i+1)+'.jpg', 'JPEG')
# Your code goes here
# If you've done this right up until now, the following should output 301
catfilelist = glob.glob('Part2_Cat_Classifying/cats/*.jpg')
print (len(catfilelist))
Finally, you might notice that the images have lots of different sizes and dimensions represented. Let's convert all the images to the same dimensions. This'll make some of them a little oddly-proportioned, but that's ok.
2.7 (3 points) Import each of the .jpg files and resize them to 256x256 pixels. Then, convert these to grayscale, as we did for the cat image above. Re-save these images with names "phdcat0.jpg" through "phdcat300.jpg".
(Sorry for your disk space, though they'll be fairly small files after this).
# Your code goes here
for j in range(0, len(catfilelist)):
im = Image.open(catfilelist[j]).resize((256, 256))
#print (im)
im.convert('L').save('Part2_Cat_Classifying/cats/phdcat'+str(j)+'.jpg', 'JPEG')
If you did this right, the following should again print out 301:
# Just run this
catfilelist = glob.glob('Part2_Cat_Classifying/cats/phdcat*.jpg')
print (len(catfilelist))
(It's probably useful to remove the old .png files and original (color) .jpg images at this point, just in case you accidentally include them when you didn't mean to. You might want to save a copy of them somewhere in case you mess something up.)
Now it's finally time to import the images into an array you can use for classification.
2.8a (2 points) Put all of the phdcats images in a numpy array called "catimages".
HINT: this will actually be a numpy array of numpy arrays, where the large array contains an array for each of the pictures, and the picture arrays contain grayscale picture values.
# Your code goes here
catimages=[]
for x in range(0, len(catfilelist)):
catimages.append(np.array(Image.open(catfilelist[x])))
catimages=np.asarray(catimages)
#print ((catimages).shape)
2.8b (1 point) What does the code below do? What do each of the numbers mean?
print(catimages.shape)
print("catimages.shape provides the shape of the entire array which is 3-dimensional, so it's an array of 301 rows (1 column), where each row contains a 256 x 256-shaped array")
print(catimages[0].shape)
print("catimages[0].shape provides the same of one entry in the 1-dimensional outer array (301 rows), and it contains a 256x256 array representing a given image")
Let's take a look at what's in this new numpy array to make sure it makes sense:
# Just run this.
# This makes a nice 4x4 display to show images numbers 0-15 in the array,
# and the underlying numerical data for four of them
fig=plt.figure(figsize=(16, 16))
for i in range(1, 17):
img = catimages[i-1]
fig.add_subplot(4, 4, i)
plt.imshow(img,cmap=plt.cm.gray)
print (catimages[:4])
Status checkpoint:
You can do the exact same thing to clean and import the scenery pictures that you did for the cat pictures. Fortunately, they're all already .jpg files.
2.9 (3 points) Convert the scenery images to grayscale and 256x256, save them as "scenery0.jpg" through "scenery332.jpg" and then put them in a numpy array called "sceneryimages". Do this all in one cell below.
(Note that we're using the word "scenery" loosely here. They're just a bunch of images without animals or people in them)
# Your code goes here
#get all scenery photos
sceneryfilelist = glob.glob('Part2_Cat_Classifying/scenery/*.jpg')
#resize, convert and rename scenery images
for j in range(0, len(sceneryfilelist)):
im = Image.open(sceneryfilelist[j]).resize((256, 256))
#print (im)
im.convert('L').save('Part2_Cat_Classifying/scenery/scenery'+str(j)+'.jpg', 'JPEG')
#get the manipulated set of photos
sceneryfilelist_converted = glob.glob('Part2_Cat_Classifying/scenery/scenery*.jpg')
#make a numpy array of them
sceneryimages=[]
for x in range(0, len(sceneryfilelist_converted)):
sceneryimages.append(np.array(Image.open(sceneryfilelist_converted[x])))
sceneryimages=np.asarray(sceneryimages)
# If you've done this correctly, this should print 333
print (len(sceneryimages))
You'll probably want to remove the original scenery .jpg images here too and either delete them or save them somewhere else
Run the following to double check your array:
# This makes the same array for scenery images that we did above for cat images, just to take a look at what's there.
fig=plt.figure(figsize=(16, 16))
for i in range(1, 17):
img = sceneryimages[i-1]
fig.add_subplot(4, 4, i)
plt.imshow(img,cmap=plt.cm.gray)
print (sceneryimages[:4])
Status checkpoint:
Now we have all of our images of cats and non-cats cleaned and imported into numpy arrays. This will serve the source for our Train_X, Validation_X, and Test_X sets once we smush the two arrays together. However, in order to train a classifier we need to give the classifier an array telling it which images are of cats and which are of non-cats (where 1 = cat, and 0 = non-cat). Since we've kept our two types of images separate, it's pretty easy to make an array of ones and an array of zeros and then smush those two arrays together.
Since we know the length of our catimages array and our sceneryimages array- 2.10a (1 point) Make a one-dimensional array full of ones as long as the catimages array 2.10b (1 point) Make a one-dimensional array full of zeros as long as the sceneryimages array.
# Your code goes here
catimages_y=[]
for catty in range(0, len(catimages)):
catimages_y.append(1)
catimages_y=np.asarray(catimages_y)
sceneryimages_y=[]
for scenic in range(0, len(sceneryimages)):
sceneryimages_y.append(0)
sceneryimages_y=np.asarray(sceneryimages_y)
# Should print 301 and 333:
print(len(catimages_y), len(sceneryimages_y))
Status checkpoint:
Now we need to smush the two images arrays together to get a final X array. One might call this process... conCATenating. ba dum tss
2.11 (2 point) Create a new array that's one dimensional and contains all the cat images followed by all of the secenery images
# Your code goes here
All_X=np.concatenate((catimages, sceneryimages))
# Should print (634, 256, 256)
print(All_X.shape)
Now smush the two y arrays together. Make sure you combine them in the correct order - if you put catimages first above, make sure to put catimages_y first here:
# Your code goes here
All_y=np.concatenate((catimages_y, sceneryimages_y))
# Should print (634,)
print(All_y.shape)
The following will show the first three, middle three, and last three images in your images array, along with the corresponding classifications. This is your last chance to make sure you've combined things properly, so if you have scenery images labeled as "1" or cat images labeled as "0", go back and figure out where things went wrong now.
# Just run this
fig=plt.figure(figsize=(16, 16))
for i in range(1, 4):
img = All_X[i-1]
title = All_y[i-1]
sp = plt.subplot(4, 4, i)
sp.set_title(title)
plt.imshow(img,cmap=plt.cm.gray)
fig=plt.figure(figsize=(16, 16))
for i in range(318, 321):
img = All_X[i-1]
title = All_y[i-1]
sp = plt.subplot(4, 4, i-317)
sp.set_title(title)
plt.imshow(img,cmap=plt.cm.gray)
fig=plt.figure(figsize=(16, 16))
for i in range(632, 635):
img = All_X[i-1]
title = All_y[i-1]
sp = plt.subplot(4, 4, i-631)
sp.set_title(title)
plt.imshow(img,cmap=plt.cm.gray)
One last step before we split these into train and test- we need to flatten them to two-dimensional arrays because that's what our classifiers will accept. This is pretty simple:
# Just run this
All_Flat_X = np.array([image.flatten() for image in All_X])
print(All_Flat_X.shape)
Now, as we did above, we need to split these final arrays into train, validation, and test sets randomly. Do this below. Let's do 60% of the data into training, 20% into validation, and 20% into testing.
There are a ton of ways to do this, but we're giving you a hint that should lead you toward a very simple way. Note that you NEED to take a random subset here. If you just put the first 60% into train, and the next 40% split between validation and test, your validation and test sets will only have scenery images in them because your data isn't in a random order right now.
2.12 (5 points) Split the data into X and y for Train, Validation, and Test sets
from sklearn.model_selection import train_test_split
# This only splits a numpy array into two random sub-arrays,
# but if you're clever you can also use it to get a validation set with just one more line of code
#google has all the answers: https://datascience.stackexchange.com/questions/15135/train-test-validation-set-splitting-in-sklearn
# Your code goes here
Train_X,Test_X, Train_y, Test_y = train_test_split(All_Flat_X, All_y, test_size=0.2)
Train_X, Validation_X, Train_y, Validation_y = train_test_split(Train_X, Train_y, test_size=0.25)
# Print out the following
print (len(Train_X), len(Train_y), len(Test_X), len(Test_y), len(Validation_X), len(Validation_y))
print (Train_y)
print (Validation_y)
print (Test_y)
print (Train_X[0])
Alright, let's test out some classifiers. We'll use the same ones as we did above for the text classification task.
2.13a (8 points) Train models based on each of the four classifiers listed here plus one more you pick. Write down average accuracy and F1 scores again as you did above.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
# Your code goes here
#MultinomialNB
#model=MultinomialNB()
#LogiisticRegression
#model=LogisticRegression()
#RandomForest
#model=RandomForestClassifier()
#KNeighbors Classifier
#model=KNeighborsClassifier()
#Decision Tree Classifier
#model = DecisionTreeClassifier()
results=cross_val_score(model, Validation_X, Validation_y, scoring='f1_weighted')
print (results.mean())
#Write down the stats the same way as you did above, 3 digits is fine
print("KNeighborsClassifier_AvgAccuracy = 0.6076042820228867")
print("KNeighborsClassifier_AvgF1 = 0.5557744957891693")
print("MultinomialNB_AvgAccuracy = 0.5819490586932448")
print("MultinomialNB_AvgF1 = 0.5792528332656012")
print("LogisticRegression_AvgAccuracy = 0.6144333702473238")
print("LogisticRegression_AvgF1 = 0.5971057649104589")
print("RandomForestClassifier_AvgAccuracy = 0.6543004798818752")
print("RandomForestClassifier_AvgF1 = 0.5917354892114172")
print("YourPick_AvgAccuracy = 0.5743816906607604")
print("YourPick_AvgF1 = 0.5727087335330859")
2.13b (8 points) Pick the best classifier you found and use it on the train and test data below. Print out your accuracy, weighted f1 score, and confusion matrix.
There might not be a clear winner, so don't worry too much about which to pick
# Your code goes here
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
model=RandomForestClassifier()
model.fit(Train_X, Train_y)
predictedvalues=model.predict(Test_X)
print("Accuracy Score: ")
print (accuracy_score(Test_y, predictedvalues))
print ("f1 Score: ")
print (f1_score(Test_y, predictedvalues))
print ("confusion matrix: ")
print (confusion_matrix(Test_y, predictedvalues))
2.13c (2 points) What would your accuracy be if you had just picked the majority class from the training set every time? Be careful you know what the majority class is!
# Your code goes here
a=len(Train_y)
#print (sum(Train_y))
#print(sum(Test_y))
b = len(Test_y)
#print (b)
if sum(Train_y)<(a/2):
print ("Majority class is scenery")
print ("Accuracy:" +str((b-sum(Test_y))/b))
if sum(Train_y)>(a/2):
print ("Majority class is cats")
print ("Accuracy:" + str((sum(Test_y)/b)))
if sum(Train_y)==(a/2):
print ("No majority class")
Okay. If your results are anything like ours, your classifier is very slightly better than random at guessing which images have cats in them.
As it turns out, image classification is a really hard problem, and in the process of cleaning this data we've removed a whole lot of information. We took away the color, made the images smaller and changed them away from their natural aspect ratios, and we flattened them out.
We tried very hard in writing this assignment to scaffold creation of a good image classifier from the ground up, but we couldn't find a way to do it that didn't involve a lot of hand-waving and saying "trust us". So we had you make a pretty bad image classifier from the ground up.
There are (for better or worse) lots of premade image classification algorithms that do much better on most datasets. In fact, if you google/youtube around for image classification you'll find almost nobody recommending an approach that involves building a classifier from scratch. Most everybody just says use TensorFlow. The following section is entirely optional (but don't forget about the third section that follows it, which is NOT optional!). You'll get some bonus points if you do this optional section, and you can see what a really high-accuracy image classifier looks like (kinda). Hopefully you'll also get to have some idea of what TensorFlow is, and it's a very widely used package in industry right now.
TensorFlow is an open source library from Google that does a variety of things across data science including complex classifications. It's hard to find an answer to what exactly TensorFlow does that's comprehensible to a layperson. The closest approximation is that TensorFlow has a bunch of super-long premade pipelines for different applications where they've found all sorts of data convolutions and other math to transform data that makes subsequent classification of it better.
We're going to quickly run through a TensorFlow image classification model using tf.keras.
This code and the code in the "BONUS_tensorflow-for-poets-2" folder is pulled (slightly modified) from TensorFlow's tutorial here
# Use a terminal/console to navigate to the BONUS_tensorflow-for-poets-2 folder that you downloaded.
# Paste the following into terminal once you're there. (Don't paste the triple quotes or semicolon)
"""
python -m scripts.retrain \
--bottleneck_dir=tf_files/bottlenecks \
--how_many_training_steps=500 \
--model_dir=tf_files/models/ \
--summaries_dir=tf_files/training_summaries/"mobilenet_0.50_224" \
--output_graph=tf_files/retrained_graph.pb \
--output_labels=tf_files/retrained_labels.txt \
--architecture="mobilenet_0.50_224" \
--image_dir=tf_files/catsvsnotcats/
"""
;
BONUS.1 (0.5 point each, 4 points total): Explain what the inputs to retrain.py above do in no more than 15 words for each.
print (len(("Bottlenecks are a feature compression layer of cached data as the final pre-processing step").split()) < 16)
print (len(("Training steps increase/decrease length of time model takes to train - in above case, decreased").split()) < 16)
print (len(("Model_dir is where the model used by tensorflow is stored").split()) < 16)
print (len(("Training summaries, log output from the training, are being placed at this dir location").split()) < 16)
print (len(("The output graph is actual model, including nodes and respective weights, saved to specified dir").split()) < 16)
print (len(("Labels used in the model are placed in the specified dir location").split()) < 16)
print (len(("Model architectures help specify configurations including input image size and feature depth of model").split()) < 16)
print (len(("Image directory specifies data used to retrain the model on your own categories").split()) < 16)
#Keep going up to 8
While the output of the script is whizzing by, you'll see lots of accuracy numbers that are probably very high. MUCH higher than the previous model. Your final accuracy should probably be higher than 95%, if not 100% on the test data.
Let's see how it does on some individual images we held out:
# While in the same folder in the terminal, paste the following script.
# It'll make a prediction for one of the files we held out of the training
"""
python3 -m scripts.label_image
--graph=tf_files/retrained_graph.pb
--image=tf_files/heldout/phdcat286.jpg
"""
;
You might get an output that looks something like this:
"cats (score=0.98744) scenery (score=0.01256)"
That basically means that it's really sure that it's a cat, which is correct in this case (which is kind of impressive given how little of the cat is shown in this particular image). You can try it on any of the other images in the heldout folder, and it'll give you similar predictions.
This model is going to be REALLY good at identifying cats vs not cat images from this dataset. How exactly does it work, you ask?
¯\(ツ)/¯ No idea. But it works!
Let's try to see if we can fool TensorFlow.
BONUS.2 (1 point or 3 points): Find an image of a cat somewhere online that wasn't included in this dataset. It must have only one cat, and no other animals or humans. All of the cat must be visible such that a human could easily tell that there's a cat in the picture. 1 point: get it so TensorFlow is less than 80% sure that there's a cat in the picture, based on the model you train on the phdcats data. 2 additional points: get it so TensorFlow guesses incorrectly that you have a picture of scenery. Display the image you found below along with the cats and scenery scores you found.
You might have to try a few pictures to get this to work. No cheating by blurring or scribbling over or otherwise screwing with your image is allowed!
"""
python3 -m scripts.label_image \
--graph=tf_files/retrained_graph.pb \
--image=tf_files/fooltfcat2.jpg
"""
#70% sure it's a cat!
tfcatpicture = Image.open('BONUS_tensorflow-for-poets-2/tf_files/fooltfcat.jpg')
plt.imshow(tfcatpicture)
print ("cats (score=0.70448)")
print ("scenery (score=0.29552)")
#100% sure it's scenery!
tfcattoonpicture = Image.open('BONUS_tensorflow-for-poets-2/tf_files/fooltfcat.png')
plt.imshow(tfcattoonpicture)
print ("scenery (score=1.00000)")
print ("cats (score=0.00000)")
This is the final part of the assignment!
Labeled Faces in the Wild (LFW) is a well-known repository of human faces originally based on people who were in the news about a decade ago. The most common face in the database is former US president George W. Bush, who was president at the time.
Through subsequent work, the original authors of the database have attached values for 73 attributes to each image, from "Pale Skin" to "Senior" to "Eyeglasses" to "Male", where a higher value indicates that the image is more representative of this label. The basis of these attributes comes from MTurk worker ratings, but a variety of fancy math led to the values they have now. You can see the full file with attributes for all of the 13,000+ images in the dataset here.
For this activity, we're going to be working off a dataset made entirely of images of George W. Bush. You can find these images in the appropriately-named "George_W_Bush" folder, along with a file called "bushattributes.csv" that shows the attributes of each of the images. If you look in this csv file, you can see that images of Bush are generally labeled with values > 0 in the dataset for Male, White, and not wearing lipstick, and < 0 for Black, Baby, and Heavy Makeup. Seems reasonable, as Bush is White and Male and doesn't usually wear lipstick. For whatever it's worth, the average "Attractive Man" score for Bush is slightly higher than the overall average in the Labeled Faces in the Wild dataset.
As there is variance in the attributes among all of these images (e.g., Bush has a higher "White" score in some images), we can train a classifier on these images and use it to predict attributes of other unseen images. A reasonable-ish (though low-quality result) exercise would be to use this to predict attributes of other images in the dataset. An unreasonable thing to do would be to have it predict attributes of images of you, which, of course, is what we're going to do.
Our importing won't be quite as easy as for the cat pictures because we can't just set all attributes to 1 or 0. We need to make sure we're importing in the same order for the images and for the attributes. Note that we've already converted the images to gray for you, though we've kept them at their original dimensions of 250x250. In the original dataset they're 250x250 and color.
3.1 (1 points): Import the images of George W Bush into a numpy array called bushimages.
# Your code goes here
import glob
import PIL
from PIL import Image
import matplotlib.pyplot as plt
bushpics=glob.glob('Part3_George_W_Bush/George_W_Bush/*.jpg')
bushimages=[]
for x in range(0, len(bushpics)):
bushimages.append(np.array(Image.open(bushpics[x])))
bushimages=np.asarray(bushimages)
#Should return 524
print(len(bushimages))
As we did a bunch of times above, let's double check what we have in the array.
3.2 (1 points): Plot the first 16 images in your array in a 4x4 grid, and output the numerical contents of the first four of them, just as we did above
# Let's make sure the first 16 images are what we expect them to be:
# Just run this
fig=plt.figure(figsize=(16, 16))
for i in range(1, 17):
img = bushimages[i-1]
fig.add_subplot(4, 4, i)
plt.imshow(img,cmap=plt.cm.gray)
3.3 (1 points): Flatten these images, as we did in the cats/scenery part above.
# Your code goes here
All_Bush_Flat_X = np.array([image.flatten() for image in bushimages])
#print(All_Bush_Flat_X.shape)
# Should print (524, 62500)
print(All_Bush_Flat_X.shape)
Next we need to come up with what we're going to predict. We can do this by importing columns from the bushattributes csv file. If you need some hints, you can look back at A1 where you imported stuff from csv files.
3.4 (1 points): Import all of the attributes from the bushattributes.csv file. Attribute names should be column headers and rows should be the photo the attribute is tied to.
# Your code goes here
import pandas
trainingdata = pandas.read_csv("Part3_George_W_Bush/George_W_Bush/bushattributes.csv", header = 0)
trainingdata = trainingdata._get_numeric_data()
trainingdata
The things we want to predict need to be one-dimensional arrays with length equal to the number of photos (524, in this case).
3.5 (1 points): Create separate numpy output arrays for the "Male", "White", "Black", "Asian", "Eyeglasses", and "Mouth Closed" columns. Then, pick any three other attributes you'd like to predict out of the 73 and create output arrays for these too.
# Your code goes here
import numpy as np
maledata=np.asarray(trainingdata['Male'])
whitedata=np.asarray(trainingdata['White'])
blackdata=np.asarray(trainingdata['Black'])
asiandata=np.asarray(trainingdata['Asian'])
eyeglassesdata=np.asarray(trainingdata['Eyeglasses'])
mouthcloseddata=np.asarray(trainingdata['Mouth Closed'])
bangsdata=np.asarray(trainingdata['Bangs'])
indiandata=np.asarray(trainingdata['Indian'])
attractmandata=np.asarray(trainingdata['Attractive Man'])
We took care of the validation etc. testing for you this time. Results: all the models are pretty bad. But that's okay, we've got enough to pitch to get VC dollars. (That was a joke. Mostly).
Go ahead and use ExtraTreesRegressor as a classifier. It'll take a while to run, but it's among the least terrible.
3.6 (4 points): Put five images of yourself in the "You" folder and convert them to grayscale and 250x250 (you can use whatever images you like, you won't need to submit them to us so you'll be the only one to see them). Then import them and flatten them and use them as a set to make predictions on using this classifier. Run the classifier nine times - once to predict each of the six attributes we specified above for all five of your photos, and once to predict each of the three attributes that you chose for all five of your photos.
Save the numbers you get each time somewhere, but please don't copy/paste the code into nine successive cells- just keep it in one cell and edit the predicted array each time. It'll make this much easier for us to read.
Grab a drink or find something good on Netflix - these models take a while to run. On the three-year-old PowerBook that this is being written on, it's taking about two minutes per model.
# Your code goes here
picsofme=glob.glob('Part3_George_W_Bush/You/*.jpg')
for pic in range(0, len(picsofme)):
im=Image.open(picsofme[pic])
im=im.convert('L').resize((250,250)).save('Part3_George_W_Bush/You/Neha'+str(pic)+'.jpg')
renamedpicsofme=glob.glob('Part3_George_W_Bush/You/Neha*.jpg')
print (renamedpicsofme)
youimages=[]
for x in range(0, len(renamedpicsofme)):
youimages.append(np.array(Image.open(renamedpicsofme[x])))
youimages=np.asarray(youimages)
All_You_Flat_X = np.array([image.flatten() for image in youimages])
#print(All_You_Flat_X.shape)
#train on attractiveman or other attribrutes (swapping out Train_y each time)
from sklearn.ensemble import ExtraTreesRegressor
model=ExtraTreesRegressor()
model.fit(All_Bush_Flat_X, attractmandata)
predictedvalues=model.predict(All_You_Flat_X)
print(predictedvalues)
Okay. With that all done:
3.7 (0.33 points each): On a scale of 1-5 for each, how well did the classifier classify your images for each of the above attributes?
#Fill in below:
"Male: 2 - I'm slightly less male than GWB (averaging across both of our 'male' column values). The classifier does think I'm male, though. To be fair, he has somewhat feminine features."
"White: 1 - I'm slightly less white than GWB (averaging across both of our 'white' column values), which just seems wrong. The classifier thinks I'm white."
"Black: 2 - GWB is slightly less black than me (averaging across both of our 'black' column values). The classifier is pretty sure I'm not black. It makes sense based on the values I received for white, but again, this is pretty wrong."
"Asian: 1 - The classifier does not think I'm Asian. Apparently I'm less Asian than GWB (averaging across both of our 'asian' column values). If the definition of the Asian column does not include South Asian, maybe that's true. But the Asian feature definition should include South Asian."
"Eyeglasses: 5- The classifier correctly knew I wasn't wearing eyeglasses. Good job classifier!"
"Mouth closed: 1- Two of my five pictures have my mouth closed. The classifier guessed that in one picture where my mouth is open that it is closed."
"Bangs: 1- I included one picture where I had bangs. It guessed that none of my pictures included bangs."
"Indian: 1- Not sure how Indian is defined here, but if we're including South Asian Indian, the classifier is pretty sure I'm not."
"Attractive Man: 1 - how could I be a less attractice man than George? There is no hope."
;
A model trained exclusively on pictures of George W. Bush is clearly going to be a biased dataset, as the data is not at all representative of... anybody other than George W. Bush, really.
As noted above, the broader dataset here comes from photos of people in the news in the mid-2000s, and George W. Bush is the most-represented person in this dataset.
3.8 (3 points) Where (presumably other than news articles from the mid-2000s) would you scrape faces from to get a dataset that you think would be quite diverse but also that you'd be confident faces like yours would be represented in? Why did you pick this dataset? (no more than 50 words).
print (len(("I would select the IMDB actor and acctress database, which usually includes images of each actor/actress. This is a global dataset, so it would include Indian people (though there is a bias toward lighter-skinned Indians in acting). ").split()) < 51)
3.9 (2 points each, 10 total): Assuming this classifier achieved significantly better scores for model quality than it currently does, which of the following would you be comfortable using it for? Why or why not? (30 words or less for each).
# Picking which photos of you are most attractive to use on Tinder/Grindr/Bumble/Coffee Meets Bagel/OkCupid...
print (len(("No, because there isn't a single objective definition of attractiveness - all include bias").split()) < 31)
# Automating your Tinder swiping by picking people who are above a certain attractiveness threshhold.
# (This has actually been done, though not with the George W Bush dataset)
print (len(("No because I couldn't tune the algo to my preferences for attractiveness").split()) < 31)
# Automatically detecting race in photos of college applicants to verify what they enter.
print (len(("No - there is risk of incorrect identification and a mistake by the classifier could result in someone not getting into college").split()) < 31)
# Detecting people who are wearing glasses in their Facebook photos in order to target glasses ads to them
print (len(("I would actually be OK with this because I already see ads on Facebook. There's no significant cost to the user for getting the classification wrong.").split()) < 31)
# Looking at professional photos of speakers at an event to count how many speakers are female
print (len(("No - I question why this task is even necessary. You could get that information from self-identification by speakers and that runs no risk of incorrect classification.").split()) < 31)
3.10 (BONUS, up to 5 points): Go download the full Labeled Faces in the Wild attributes csv, the one with 13,000+ rows. You DO NOT have to download all the photos. For all nine of the attributes you've been using above, find the average value across the whole dataset. Next, go skim the paper where they talk about how the attribute values were generated. What do each of these attributes mean? (no word limit, but be kind to us)
# Your bonus response here
Once you've completed all of the above, you're done with assignment 3! You might want to double check that your code works like you expect. You can do this by choosing "Restart & Run All" in the Kernel menu. If it outputs errors, you may want to go back and check what you've done.
Once you think everything is set, please run ALL of your code, download your final notebook as HTML, and submit to the A3 folder on the Canvas site with name [yourandrewid]_haiif18a[assignmentnumber], e.g., jseering_haiif18a3.